<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
  <meta name="author" content="PG-Strom Development Team">
  <link rel="shortcut icon" href="../img/favicon.ico">
  <title>GPUDirect SQL - PG-Strom Manual</title>
  <link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>

  <link rel="stylesheet" href="../css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../css/theme_extra.css" type="text/css" />
  <link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
  <link href="//fonts.googleapis.com/earlyaccess/notosansjp.css" rel="stylesheet">
  <link href="//fonts.googleapis.com/css?family=Open+Sans:600,800" rel="stylesheet">
  <link href="../custom.css" rel="stylesheet">
  
  <script>
    // Current page data
    var mkdocs_page_name = "GPUDirect SQL";
    var mkdocs_page_input_path = "ssd2gpu.md";
    var mkdocs_page_url = null;
  </script>
  
  <script src="../js/jquery-2.1.1.min.js" defer></script>
  <script src="../js/modernizr-2.8.3.min.js" defer></script>
  <script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
  <script>hljs.initHighlightingOnLoad();</script> 
  
</head>

<body class="wy-body-for-nav" role="document">

  <div class="wy-grid-for-nav">

    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
      <div class="wy-side-nav-search">
        <a href=".." class="icon icon-home"> PG-Strom Manual</a>
        <div role="search">
  <form id ="rtd-search-form" class="wy-form" action="../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
  </form>
  [<a href="../ja/ssd2gpu/" style="color: #cccccc">Japanese</a> | <strong>English</strong>]
</div>
      </div>

      <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
	<ul class="current">
	  
          
            <li class="toctree-l1">
		
    <a class="" href="..">Home</a>
	    </li>
          
            <li class="toctree-l1">
		
    <a class="" href="../install/">Install</a>
	    </li>
          
            <li class="toctree-l1">
		
    <span class="caption-text">Tutorial</span>
    <ul class="subnav">
                <li class="">
                    
    <a class="" href="../operations/">Basic Operations</a>
                </li>
                <li class="">
                    
    <a class="" href="../sys_admin/">System Administration</a>
                </li>
                <li class="">
                    
    <a class="" href="../brin/">BRIN Index</a>
                </li>
                <li class="">
                    
    <a class="" href="../partition/">Partitioning</a>
                </li>
                <li class="">
                    
    <a class="" href="../postgis/">PostGIS</a>
                </li>
                <li class="">
                    
    <a class="" href="../troubles/">Trouble Shooting</a>
                </li>
    </ul>
	    </li>
          
            <li class="toctree-l1">
		
    <span class="caption-text">Advanced Features</span>
    <ul class="subnav">
                <li class=" current">
                    
    <a class="current" href="./">GPUDirect SQL</a>
    <ul class="subnav">
            
    <li class="toctree-l3"><a href="#overview">Overview</a></li>
    

    <li class="toctree-l3"><a href="#system-setup">System Setup</a></li>
    
        <ul>
        
            <li><a class="toctree-l4" href="#driver-installation">Driver Installation</a></li>
        
            <li><a class="toctree-l4" href="#designing-tablespace">Designing Tablespace</a></li>
        
            <li><a class="toctree-l4" href="#distance-between-gpu-and-nvme-ssd">Distance between GPU and NVME-SSD</a></li>
        
        </ul>
    

    <li class="toctree-l3"><a href="#operations">Operations</a></li>
    
        <ul>
        
            <li><a class="toctree-l4" href="#controls-using-guc-parameters">Controls using GUC parameters</a></li>
        
            <li><a class="toctree-l4" href="#attension-for-visibility-map">Attension for visibility map</a></li>
        
        </ul>
    

    </ul>
                </li>
                <li class="">
                    
    <a class="" href="../arrow_fdw/">Apache Arrow</a>
                </li>
                <li class="">
                    
    <a class="" href="../gstore_fdw/">GPU Memory Store</a>
                </li>
                <li class="">
                    
    <a class="" href="../python/">Python cooperation</a>
                </li>
    </ul>
	    </li>
          
            <li class="toctree-l1">
		
    <span class="caption-text">References</span>
    <ul class="subnav">
                <li class="">
                    
    <a class="" href="../ref_types/">Data Types</a>
                </li>
                <li class="">
                    
    <a class="" href="../ref_devfuncs/">Functions and Operators</a>
                </li>
                <li class="">
                    
    <a class="" href="../ref_sqlfuncs/">SQL Objects</a>
                </li>
                <li class="">
                    
    <a class="" href="../ref_params/">GUC Parameters</a>
                </li>
    </ul>
	    </li>
          
            <li class="toctree-l1">
		
    <a class="" href="../release_note/">Release Note</a>
	    </li>
          
        </ul>
      </div>
      &nbsp;
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
        <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
        <a href="..">PG-Strom Manual</a>
      </nav>

      
      <div class="wy-nav-content">
        <div class="rst-content">
          <div role="navigation" aria-label="breadcrumbs navigation">
  <ul class="wy-breadcrumbs">
    <li><a href="..">Docs</a> &raquo;</li>
    
      
        
          <li>Advanced Features &raquo;</li>
        
      
    
    <li>GPUDirect SQL</li>
    <li class="wy-breadcrumbs-aside">
      
    </li>
  </ul>
  <hr/>
</div>
          <div role="main">
            <div class="section">
              
                <h1>SSD-to-GPU Direct SQL Execution</h1>

<h1 id="overview">Overview</h1>
<p>For the fast execution of SQL workloads, it needs to provide processors rapid data stream from storage or memory, in addition to processor's execution efficiency. Processor will run idle if data stream would not be delivered.</p>
<p>SSD-to-GPU Direct SQL Execution directly connects NVMe-SSD which enables high-speed I/O processing by direct attach to the PCIe bus and GPU device that is also attached on the same PCIe bus, and runs SQL workloads very high speed by supplying data stream close to the wired speed of the hardware.</p>
<p>Usually, PostgreSQL data blocks on the storage shall be once loaded to CPU/RAM through the PCIe bus, then, PostgreSQL runs WHERE-clause for filtering or JOIN/GROUP BY according to the query execution plan. Due to the characteristics of analytic workloads, the amount of result data set is much smaller than the source data set. For example, it is not rare case to read billions rows but output just hundreds rows after the aggregation operations with GROUP BY.</p>
<p>In the other words, we consume bandwidth of the PCIe bus to move junk data, however, we cannot determine whether rows are necessary or not prior to the evaluation by SQL workloads on CPU. So, it is not avoidable restriction in usual implementation.</p>
<p><img alt="SSD2GPU Direct SQL Execution Overview" src="../img/ssd2gpu-overview.png" /></p>
<p>SSD-to-GPU Direct SQL Execution changes the flow to read blocks from the storage sequentially. It directly loads data blocks to GPU using peer-to-peer DMA over PCIe bus, then runs SQL workloads on GPU device to reduce number of rows to be processed by CPU. In other words, it utilizes GPU as a pre-processor of SQL which locates in the middle of the storage and CPU/RAM for reduction of CPU's load, then tries to accelerate I/O processing in the results.</p>
<p>This feature internally uses NVIDIA GPUDirect RDMA. It allows peer-to-peer data transfer over PCIe bus between GPU device memory and third parth device by coordination using a custom Linux kernel module.
So, this feature requires NVMe-Strom driver which is a Linux kernel module in addition to PG-Strom which is a PostgreSQL extension module.</p>
<p>Also note that this feature supports only NVMe-SSD. It does not support SAS or SATA SSD.
We have tested several NVMe-SSD models. You can refer <a href="https://github.com/heterodb/pg-strom/wiki/002:-HW-Validation-List#nvme-ssd-validation-list">002: HW Validation List</a> for your information.</p>
<h1 id="system-setup">System Setup</h1>
<h2 id="driver-installation">Driver Installation</h2>
<p><code>nvme_strom</code> package is required to activate SSD-to-GPU Direct SQL Execution. This package contains a custom Linux kernel module which intermediates P2P DMA from NVME-SSD to GPU. You can obtain the package from the <a href="https://heterodb.github.io/swdc/">HeteroDB Software Distribution Center</a>.</p>
<p>If <code>heterodb-swdc</code> package is already installed, you can install the package by <code>yum</code> command.</p>
<pre><code>$ sudo yum install nvme_strom
            :
================================================================================
 Package             Arch            Version            Repository         Size
================================================================================
Installing:
 nvme_strom          x86_64          0.8-1.el7          heterodb          178 k

Transaction Summary
================================================================================
Install  1 Package
            :
DKMS: install completed.
  Verifying  : nvme_strom-0.8-1.el7.x86_64                                  1/1

Installed:
  nvme_strom.x86_64 0:0.8-1.el7

Complete!
</code></pre>

<p>Once <code>nvme_strom</code> package gets installed, you can see <code>nvme_strom</code> module using <code>lsmod</code> command below.</p>
<pre><code>$ lsmod | grep nvme
nvme_strom             12625  0
nvme                   27722  4
nvme_core              52964  9 nvme
</code></pre>

<h2 id="designing-tablespace">Designing Tablespace</h2>
<p>SSD-to-GPU Direct SQL Execution shall be invoked in the following case.</p>
<ul>
<li>The target table to be scanned locates on the partition being consist of NVMe-SSD.<ul>
<li><code>/dev/nvmeXXXX</code> block device, or md-raid0 volume which consists of NVMe-SSDs only.</li>
</ul>
</li>
<li>The target table size is larger than <code>pg_strom.nvme_strom_threshold</code>.<ul>
<li>You can adjust this configuration. Its default is physical RAM size of the system plus 1/3 of <code>shared_buffers</code> configuration.</li>
</ul>
</li>
</ul>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Striped read from multiple NVMe-SSD using md-raid0 requires the enterprise subscription provided by HeteroDB,Inc.</p>
</div>
<p>In order to deploy the tables on the partition consists of NVMe-SSD, you can use the tablespace function of PostgreSQL to specify particular tables or databases to place them on NVMe-SSD volume, in addition to construction of the entire database cluster on the NVMe-SSD volume.</p>
<p>For example, you can create a new tablespace below, if NVMe-SSD is mounted at <code>/opt/nvme</code>.</p>
<pre><code>CREATE TABLESPACE my_nvme LOCATION '/opt/nvme';
</code></pre>

<p>In order to create a new table on the tablespace, specify the <code>TABLESPACE</code> option at the <code>CREATE TABLE</code> command below.</p>
<pre><code>CREATE TABLE my_table (...) TABLESPACE my_nvme;
</code></pre>

<p>Or, use <code>ALTER DATABASE</code> command as follows, to change the default tablespace of the database.
Note that tablespace of the existing tables are not changed in thie case.</p>
<pre><code>ALTER DATABASE my_database SET TABLESPACE my_nvme;
</code></pre>

<h2 id="distance-between-gpu-and-nvme-ssd">Distance between GPU and NVME-SSD</h2>
<p>On selection of server hardware and installation of GPU and NVME-SSD, hardware configuration needs to pay attention to the distance between devices, to pull out maximum performance of the device.</p>
<p><a href="https://docs.nvidia.com/cuda/gpudirect-rdma/">NVIDIA GPUDirect RDMA</a>, basis of the SSD-to-GPU Direct SQL mechanism, requires both of the edge devices of P2P DMA are connected on the same PCIe root complex. In the other words, unable to configure the P2P DMA traverses QPI between CPUs when NVME-SSD is attached on CPU1 and GPU is attached on CPU2 at dual socket system.</p>
<p>From standpoint of the performance, it is recommended to use dedicated PCIe-switch to connect both of the devices more than the PCIe controller built in CPU.</p>
<p>The photo below is a motherboard of HPC server. It has 8 of PCIe x16 slots, and each pair is linked to the other over the PCIe switch. The slots in the left-side of the photo are connected to CPU1, and right-side are connected to CPU2.</p>
<p>When a table on SSD-2 is scanned using SSD-to-GPU Direct SQL, the optimal GPU choice is GPU-2, and it may be able to use GPU1. However, we have to avoid to choose GPU-3 and GPU-4 due to the restriction of GPUDirect RDMA.</p>
<p><img alt="Motherboard of HPC Server" src="../img/pcie-hpc-server.png" /></p>
<p>PG-Strom calculate logical distances on any pairs of GPU and NVME-SSD using PCIe bus topology information of the system on startup time.
It is displayed at the start up log. Each NVME-SSD determines the preferable GPU based on the distance, for example, <code>GPU1</code> shall be used on scan of the <code>/dev/nvme2</code>.</p>
<pre><code>$ pg_ctl restart
     :
LOG:  GPU&lt;-&gt;SSD Distance Matrix
LOG:             GPU0     GPU1     GPU2
LOG:      nvme0  (   3)      7       7
LOG:      nvme5      7       7   (   3)
LOG:      nvme4      7       7   (   3)
LOG:      nvme2      7   (   3)      7
LOG:      nvme1  (   3)      7       7
LOG:      nvme3      7   (   3)      7
     :
</code></pre>

<p>Usually automatic configuration works well.
In case when NVME-over-Fabric(RDMA) is used, unable to identify the location of nvme device on the PCIe-bus, so you need to configure the logical distance between NVME-SSD and GPU manually.</p>
<p>The example below shows the configuration of <code>gpu2</code> for <code>nvme1</code>, and <code>gpu1</code> for <code>nvme2</code> and <code>nvme3</code>.
It shall be added to <code>postgresql.conf</code>. Please note than manual configuration takes priority than the automatic configuration.</p>
<pre><code>pg_strom.nvme_distance_map = nvme1:gpu2, nvme2:gpu1, nvme3:gpu1
</code></pre>

<h1 id="operations">Operations</h1>
<h2 id="controls-using-guc-parameters">Controls using GUC parameters</h2>
<p>There are two GPU parameters related to SSD-to-GPU Direct SQL Execution.</p>
<p>The first is <code>pg_strom.nvme_strom_enabled</code> that simply turn on/off the function of SSD-to-GPU Direct SQL Execution.
If <code>off</code>, SSD-to-GPU Direct SQL Execution should not be used regardless of the table size or physical location. Default is <code>on</code>.</p>
<p>The other one is <code>pg_strom.nvme_strom_threshold</code> which specifies the least table size to invoke SSD-to-GPU Direct SQL Execution.</p>
<p>PG-Strom will choose SSD-to-GPU Direct SQL Execution when target table is located on NVMe-SSD volume (or md-raid0 volume which consists of NVMe-SSD only), and the table size is larger than this parameter.
Its default is sum of the physical memory size and 1/3 of the <code>shared_buffers</code>. It means default configuration invokes SSD-to-GPU Direct SQL Execution only for the tables where we certainly cannot process them on memory.</p>
<p>Even if SSD-to-GPU Direct SQL Execution has advantages on a single table scan workload, usage of disk cache may work better on the second or later trial for the tables which are available to load onto the main memory.</p>
<p>On course, this assumption is not always right depending on the workload charasteristics.</p>
<h3 id="ensure-usage-of-ssd-to-gpu-direct-sql-execution">Ensure usage of SSD-to-GPU Direct SQL Execution</h3>
<p><code>EXPLAIN</code> command allows to ensure whether SSD-to-GPU Direct SQL Execution shall be used in the target query, or not.</p>
<p>In the example below, a scan on the <code>lineorder</code> table by <code>Custom Scan (GpuJoin)</code> shows <code>NVMe-Strom: enabled</code>. In this case, SSD-to-GPU Direct SQL Execution shall be used to read from the <code>lineorder</code> table.</p>
<pre><code># explain (costs off)
select sum(lo_revenue), d_year, p_brand1
from lineorder, date1, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
  group by d_year, p_brand1
  order by d_year, p_brand1;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 GroupAggregate
   Group Key: date1.d_year, part.p_brand1
   -&gt;  Sort
         Sort Key: date1.d_year, part.p_brand1
         -&gt;  Custom Scan (GpuPreAgg)
               Reduction: Local
               GPU Projection: pgstrom.psum((lo_revenue)::double precision), d_year, p_brand1
               Combined GpuJoin: enabled
               -&gt;  Custom Scan (GpuJoin) on lineorder
                     GPU Projection: date1.d_year, part.p_brand1, lineorder.lo_revenue
                     Outer Scan: lineorder
                     Depth 1: GpuHashJoin  (nrows 2406009600...97764190)
                              HashKeys: lineorder.lo_partkey
                              JoinQuals: (lineorder.lo_partkey = part.p_partkey)
                              KDS-Hash (size: 10.67MB)
                     Depth 2: GpuHashJoin  (nrows 97764190...18544060)
                              HashKeys: lineorder.lo_suppkey
                              JoinQuals: (lineorder.lo_suppkey = supplier.s_suppkey)
                              KDS-Hash (size: 131.59MB)
                     Depth 3: GpuHashJoin  (nrows 18544060...18544060)
                              HashKeys: lineorder.lo_orderdate
                              JoinQuals: (lineorder.lo_orderdate = date1.d_datekey)
                              KDS-Hash (size: 461.89KB)
                     NVMe-Strom: enabled
                     -&gt;  Custom Scan (GpuScan) on part
                           GPU Projection: p_brand1, p_partkey
                           GPU Filter: (p_category = 'MFGR#12'::bpchar)
                     -&gt;  Custom Scan (GpuScan) on supplier
                           GPU Projection: s_suppkey
                           GPU Filter: (s_region = 'AMERICA'::bpchar)
                     -&gt;  Seq Scan on date1
(31 rows)
</code></pre>

<h2 id="attension-for-visibility-map">Attension for visibility map</h2>
<p>Right now, GPU routines of PG-Strom cannot run MVCC visibility checks per row, because only host code has a special data structure for visibility checks. It also leads a problem.</p>
<p>We cannot know which row is visible, or invisible at the time when PG-Strom requires P2P DMA for NVMe-SSD, because contents of the storage blocks are not yet loaded to CPU/RAM, and MVCC related attributes are written with individual records. PostgreSQL had similar problem when it supports IndexOnlyScan.</p>
<p>To address the problem, PostgreSQL has an infrastructure of visibility map which is a bunch of flags to indicate whether any records in a particular data block are visible from all the transactions. If associated bit is set, we can know the associated block has no invisible records without reading the block itself.</p>
<p>SSD-to-GPU Direct SQL Execution utilizes this infrastructure. It checks the visibility map first, then only "all-visible" blocks are required to read with SSD-to-GPU P2P DMA.</p>
<p>VACUUM constructs visibility map, so you can enforce PostgreSQL to construct visibility map by explicit launch of VACUUM command.</p>
<pre><code>VACUUM ANALYZE linerorder;
</code></pre>
              
            </div>
          </div>
          <footer>
  
    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
      
        <a href="../arrow_fdw/" class="btn btn-neutral float-right" title="Apache Arrow">Next <span class="icon icon-circle-arrow-right"></span></a>
      
      
        <a href="../troubles/" class="btn btn-neutral" title="Trouble Shooting"><span class="icon icon-circle-arrow-left"></span> Previous</a>
      
    </div>
  

  <hr/>

  <div role="contentinfo">
    <!-- Copyright etc -->
    
  </div>

  Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
      
        </div>
      </div>

    </section>

  </div>

  <div class="rst-versions" role="note" style="cursor: pointer">
    <span class="rst-current-version" data-toggle="rst-current-version">
      
      
        <span><a href="../troubles/" style="color: #fcfcfc;">&laquo; Previous</a></span>
      
      
        <span style="margin-left: 15px"><a href="../arrow_fdw/" style="color: #fcfcfc">Next &raquo;</a></span>
      
    </span>
</div>
    <script>var base_url = '..';</script>
    <script src="../js/theme.js" defer></script>
      <script src="../search/main.js" defer></script>

</body>
</html>
